WEEK 4: DATA JOINS AND TRANSFORMATIONS

Monday, January 30th

Today we will…

Lab 3: Familiar Words – Sketch it out!

For each demographic group listed below, determine which word(s) in this study was(were) the most and least familiar on average.

Data Layouts

Tidy Data

Image source: R4DS

Untidy “Messy” Data

Illustration by Allison Horst

Our tools!

Illustration by Allison Horst

Collaboration with tidy data.

Illustration by Allison Horst

Pivoting Data

Tidyexpalin animation by Kelsey Gonzalez

Manual Method

We can do the wide-to-long transition manually.

Consider the table of daily rainfall observed in SLO in January 2023. This data is recorded in human-friendly form, in the approximate shape of a calendar. Each week has its own row, and each day has its own column.

Talk to a neighbor about how you would convert this to long format. You may want to open up the spreadsheet containing this table on your computer.

Data source

Manual Method: Steps

  1. Create a new column: Day_of_Week.

  2. Create a Rainfall column to hold the daily rainfall values.

  3. Now we have three columns setup (Week, Day_of_Week, and Rainfall) – start moving data over.

  4. Duplicate repeated data (Week 1-5) and copy Monday over.

  5. Duplicate repeated data (Week 1-5) and copy Tuesday over.

  6. Continue for the rest of the days of the week.

  7. You may want to arrange() the data by week to get chronological order of the rainfall values.

Computational Approach

pivot_longer()

library(readxl)
slo_rainfall <- read_xlsx("2023-rainfall-slo.xlsx")

slo_rainfall_long <- slo_rainfall |> 
  mutate(across(Sunday:Saturday, na_if, y = "NA"),
         across(Sunday:Saturday, as.numeric)
         ) |> 
  pivot_longer(cols      = Sunday:Saturday,
               names_to  = "Day_of_Week",
               values_to = "Daily_Rainfall")
slo_rainfall_long
# A tibble: 35 × 3
    Week Day_of_Week Daily_Rainfall
   <dbl> <chr>                <dbl>
 1     1 Sunday                0   
 2     1 Monday                0.12
 3     1 Tuesday               0   
 4     1 Wednesday             1.58
 5     1 Thursday              0.91
 6     1 Friday                0   
 7     1 Saturday              0.05
 8     2 Sunday                0.27
 9     2 Monday                4.26
10     2 Tuesday               0.43
# … with 25 more rows

Why tidy data?

library(liver)
data(cereal)
str(cereal, hide.attr = TRUE)
'data.frame':   77 obs. of  16 variables:
 $ name    : Factor w/ 77 levels "100% Bran","100% Natural Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ manuf   : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
 $ type    : Factor w/ 2 levels "cold","hot": 1 1 1 1 1 1 1 1 1 1 ...
 $ calories: int  70 120 70 50 110 110 110 130 90 90 ...
 $ protein : int  4 3 4 4 2 2 2 3 2 3 ...
 $ fat     : int  1 5 1 0 2 2 0 2 1 0 ...
 $ sodium  : int  130 15 260 140 200 180 125 210 200 210 ...
 $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
 $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
 $ sugars  : int  6 8 5 0 8 10 14 8 6 5 ...
 $ potass  : int  280 135 320 330 -1 70 30 100 125 190 ...
 $ vitamins: int  25 0 25 25 25 25 25 25 25 25 ...
 $ shelf   : int  3 3 3 3 3 1 2 3 1 3 ...
 $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
 $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
 $ rating  : num  68.4 34 59.4 93.7 34.4 ...
cereal_summary1 <- cereal |> 
  group_by(shelf) |> 
  summarise(across(calories:vitamins, mean))
cereal_summary1
# A tibble: 3 × 10
  shelf calories protein   fat sodium fiber carbo sugars potass vitamins
  <int>    <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>    <dbl>
1     1     102.    2.65  0.6    176. 1.68   15.8   4.8    75.5     20  
2     2     110.    1.90  1      146. 0.905  13.6   9.62   57.8     23.8
3     3     108.    2.86  1.25   159. 3.14   14.5   6.53  130.      35.4
my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")

cereal_summary1 |> 
  ggplot() +
  geom_point(aes(x = shelf, y = calories, color = "calories_col")) +
  geom_line(aes(x = shelf, y = calories, color = "calories_col")) + 
  geom_point(aes(x = shelf, y = sugars, color = "sugars_col")) +
  geom_line(aes(x = shelf, y = sugars, color = "sugars_col")) +
  scale_color_manual(values = my_colors, labels = names(my_colors)) +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")

cereal_summary2<- cereal |> 
  pivot_longer(cols = calories:vitamins,
               names_to = "Nutrient",
               values_to = "Amount") |> 
  group_by(shelf, Nutrient) |> 
  summarise(mean_amount = mean(Amount))
cereal_summary2
# A tibble: 27 × 3
# Groups:   shelf [3]
   shelf Nutrient mean_amount
   <int> <chr>          <dbl>
 1     1 calories      102.  
 2     1 carbo          15.8 
 3     1 fat             0.6 
 4     1 fiber           1.68
 5     1 potass         75.5 
 6     1 protein         2.65
 7     1 sodium        176.  
 8     1 sugars          4.8 
 9     1 vitamins       20   
10     2 calories      110.  
# … with 17 more rows
cereal_summary2 |> 
  ggplot(aes(x = shelf, 
             y = mean_amount, 
             color = Nutrient)) +
  geom_point() +
  geom_line() +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount")

pivot_wider()

mean_protein <- cereal |> 
  group_by(manuf, shelf) |> 
  summarize(mean_protein = mean(protein))
mean_protein
# A tibble: 18 × 3
# Groups:   manuf [7]
   manuf shelf mean_protein
   <fct> <int>        <dbl>
 1 A         2         4   
 2 G         1         3   
 3 G         2         1.29
 4 G         3         2.67
 5 K         1         2.75
 6 K         2         2.14
 7 K         3         2.92
 8 N         1         2.67
 9 N         2         2.5 
10 N         3         4   
11 P         1         1.5 
12 P         2         1   
13 P         3         3   
14 Q         1         5   
15 Q         2         2   
16 Q         3         2.5 
17 R         1         2   
18 R         3         3   
protein_wide <- mean_protein |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein)
protein_wide
# A tibble: 7 × 4
# Groups:   manuf [7]
  manuf   `1`   `2`   `3`
  <fct> <dbl> <dbl> <dbl>
1 G      3     1.29  2.67
2 K      2.75  2.14  2.92
3 N      2.67  2.5   4   
4 P      1.5   1     3   
5 Q      5     2     2.5 
6 R      2    NA     3   
7 A     NA     4    NA   

Better names in pivot_wider()

protein_wide <- mean_protein |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein,
              names_prefix = "Shelf ")
protein_wide
# A tibble: 7 × 4
# Groups:   manuf [7]
  manuf `Shelf 1` `Shelf 2` `Shelf 3`
  <fct>     <dbl>     <dbl>     <dbl>
1 G          3         1.29      2.67
2 K          2.75      2.14      2.92
3 N          2.67      2.5       4   
4 P          1.5       1         3   
5 Q          5         2         2.5 
6 R          2        NA         3   
7 A         NA         4        NA   

Data Joins

Relational Data

Multiple tables of data are called relational data because it is the relations, not just the individual data sets, that are important.

IMDb movie relational data

Data Joins

Mutating joins

Adds information from a new dataframe to observations in an existing dataframe

full_join(), left_join(), right_join(), inner_join(), outer_join()

Filtering Joins

Filters observations based on values in new dataframe

semi_join(), anti_join()

Keys

Uniquely identifies an observation in a data set

Relate data sets to each other

inner_join()

Matches pairs of observations when “keys” are equal

Inner Join: IMDb Example

directors_genres
# A tibble: 10 × 3
# Groups:   director_id [5]
   director_id genre      prob
         <dbl> <chr>     <dbl>
 1         429 Adventure 0.75 
 2         429 Fantasy   0.75 
 3        2931 Drama     0.714
 4        2931 Action    0.429
 5       11652 Sci-Fi    0.5  
 6       11652 Action    0.5  
 7       14927 Animation 1    
 8       14927 Family    1    
 9       15092 Comedy    0.545
10       15092 Crime     0.545

Directors: 429, 2931, 9247, 11652, 14927, 15092

movies_directors
# A tibble: 7 × 2
  director_id movie_id
        <dbl>    <dbl>
1         429   300229
2        9247   124110
3       11652    10920
4       11652   333856
5       14927   192017
6       15092   109093
7       15092   237431

Directors: 429, 2931, 9247, 11652, 14927, 15092

inner_join(directors_genres, movies_directors)
# A tibble: 12 × 4
# Groups:   director_id [4]
   director_id genre      prob movie_id
         <dbl> <chr>     <dbl>    <dbl>
 1         429 Adventure 0.75    300229
 2         429 Fantasy   0.75    300229
 3       11652 Sci-Fi    0.5      10920
 4       11652 Sci-Fi    0.5     333856
 5       11652 Action    0.5      10920
 6       11652 Action    0.5     333856
 7       14927 Animation 1       192017
 8       14927 Family    1       192017
 9       15092 Comedy    0.545   109093
10       15092 Comedy    0.545   237431
11       15092 Crime     0.545   109093
12       15092 Crime     0.545   237431

Directors: 429, 2931, 9247, 11652, 14927, 15092

Inner Join: IMDb Example

What if our key variable is not named the same?

directors_genres
# A tibble: 10 × 3
# Groups:   director_id [5]
   director_id genre      prob
         <dbl> <chr>     <dbl>
 1         429 Adventure 0.75 
 2         429 Fantasy   0.75 
 3        2931 Drama     0.714
 4        2931 Action    0.429
 5       11652 Sci-Fi    0.5  
 6       11652 Action    0.5  
 7       14927 Animation 1    
 8       14927 Family    1    
 9       15092 Comedy    0.545
10       15092 Crime     0.545
directors
# A tibble: 5 × 3
     id first_name last_name
  <dbl> <chr>      <chr>    
1   429 Andrew     Adamson  
2  9247 Zach       Braff    
3 11652 James (I)  Cameron  
4 14927 Ron        Clements 
5 15092 Ethan      Coen     
inner_join(directors_genres, 
           directors, 
           by = c("director_id" = "id")
           )
# A tibble: 8 × 5
     id first_name last_name genre      prob
  <dbl> <chr>      <chr>     <chr>     <dbl>
1   429 Andrew     Adamson   Adventure 0.75 
2   429 Andrew     Adamson   Fantasy   0.75 
3 11652 James (I)  Cameron   Sci-Fi    0.5  
4 11652 James (I)  Cameron   Action    0.5  
5 14927 Ron        Clements  Animation 1    
6 14927 Ron        Clements  Family    1    
7 15092 Ethan      Coen      Comedy    0.545
8 15092 Ethan      Coen      Crime     0.545

Mutating Joins

  • left_join() Everything is kept in the data set on the left

  • right_join() Everything is kept in the data set on the right

  • full_join() Everything is kept in both data sets

Mutating Joins

Discuss with a neighbor.

Which of the following directors would be kept in for each of:

  • left_join(directors_genres, movies_directors)
  • right_join(directors_genres, movies_directors)
  • full_join(directors_genres, movies_directors)
directors_genres |> 
  distinct(director_id)
# A tibble: 5 × 1
# Groups:   director_id [5]
  director_id
        <dbl>
1         429
2        2931
3       11652
4       14927
5       15092
movies_directors |> 
  distinct(director_id)
# A tibble: 5 × 1
  director_id
        <dbl>
1         429
2        9247
3       11652
4       14927
5       15092

Filtering Joins: semi_join()

semi_join() Keeping observations

semi_join(directors_genres, movies_directors)
# A tibble: 8 × 3
# Groups:   director_id [4]
  director_id genre      prob
        <dbl> <chr>     <dbl>
1         429 Adventure 0.75 
2         429 Fantasy   0.75 
3       11652 Sci-Fi    0.5  
4       11652 Action    0.5  
5       14927 Animation 1    
6       14927 Family    1    
7       15092 Comedy    0.545
8       15092 Crime     0.545

Movie Directors: 429, 2931, 11652, 14927, 15092

Including observations with %in%

directors_genres |>
  filter(director_id %in% c(429, 9247, 11652, 14927, 15092))
# A tibble: 8 × 3
# Groups:   director_id [4]
  director_id genre      prob
        <dbl> <chr>     <dbl>
1         429 Adventure 0.75 
2         429 Fantasy   0.75 
3       11652 Sci-Fi    0.5  
4       11652 Action    0.5  
5       14927 Animation 1    
6       14927 Family    1    
7       15092 Comedy    0.545
8       15092 Crime     0.545

similar to semi_join()!

Filtering Joins: anti_join()

anti_join() Removing Observations

anti_join(directors_genres, movies_directors)
# A tibble: 2 × 3
# Groups:   director_id [1]
  director_id genre   prob
        <dbl> <chr>  <dbl>
1        2931 Drama  0.714
2        2931 Action 0.429

Movie Directors: 429, 2931, 11652, 14927, 15092

Excluding observations with !%in%

directors_genres |>
  filter(!director_id %in% c(429, 9247, 11652, 14927, 15092))
# A tibble: 2 × 3
# Groups:   director_id [1]
  director_id genre   prob
        <dbl> <chr>  <dbl>
1        2931 Drama  0.714
2        2931 Action 0.429

similar to anti_join()!

A note about piping joins

inner_join(directors_genres, movies_directors)

directors_genres |> 
  inner_join(movies_directors)

PA 4: Military Spending

Today you will be tidying untidy data to explore the relationship between countries of the world and military spending.

Due Wednesday, 2/1 at 8:00am

Bonus Challenge: Murder Mystery in SQL City

For this challenge, you will be using table joins to solve a murder mystery.

Due Sunday, 2/12 at 11:59pm

To do…

  • PA 4: Military Spending
    • Due Wednesday 2/1 at 8:00am
  • Bonus Challenge: Murder Mystery in SQL City
    • Due Sunday 2/12 at 11:59pm

Wednesday, January 25th

Today we will…

  • Review PA 4: Military Spending
  • Housekeeping items…
    • Providing References in Labs + Challenges
    • Clean variable names
    • Lifecycle stages
    • dplyr package updates
    • Saving & Piping Data Joins
  • Extensions to Relational Data
  • Lab 4: Avocado Prices
  • Challenge 3: Avocado Toast Ate My Mortgage

Getting Help and using Chat GPT


Lab 3: Distinct number of words


count(distinct(hiphop_clean, word))


hiphop_clean |> 
  distinct(word) |> 
  count()


\(f(g(h(x)))\)

in piping syntax is

x |> h() |> g() |> f()

n_distinct(hiphop_clean$word)


hiphop_clean |> 
  pull(word) |> 
  n_distinct()

Clean variable names with library(janitor)

Converts all names of variables in a data set to snake_case.

names(military)
 [1] "Country"        "Notes"          "Reporting year" "1988"          
 [5] "1989"           "1990"           "1991"           "1992"          
 [9] "1993"           "1994"           "1995"           "1996"          
[13] "1997"           "1998"           "1999"           "2000"          
[17] "2001"           "2002"           "2003"           "2004"          
[21] "2005"           "2006"           "2007"           "2008"          
[25] "2009"           "2010"           "2011"           "2012"          
[29] "2013"           "2014"           "2015"           "2016"          
[33] "2017"           "2018"           "2019"          
library(janitor)
military_clean_names <- military |> 
  clean_names()

names(military_clean_names)
 [1] "country"        "notes"          "reporting_year" "x1988"         
 [5] "x1989"          "x1990"          "x1991"          "x1992"         
 [9] "x1993"          "x1994"          "x1995"          "x1996"         
[13] "x1997"          "x1998"          "x1999"          "x2000"         
[17] "x2001"          "x2002"          "x2003"          "x2004"         
[21] "x2005"          "x2006"          "x2007"          "x2008"         
[25] "x2009"          "x2010"          "x2011"          "x2012"         
[29] "x2013"          "x2014"          "x2015"          "x2016"         
[33] "x2017"          "x2018"          "x2019"         

Lifceycle stages

Learn more about lifecycle stages of packages, functions, function arguments in R.

Image source: deanattali.com/blog/cranalerts/

Deprecated Functions: Military Spending Example

military_clean |> 
  filter(across(Notes:`2019`, is.na)) |> 
  slice_head(n = 3)
Warning: Using `across()` in `filter()` is deprecated, use `if_any()` or
`if_all()`.
# A tibble: 3 × 35
  Country  Notes Repor…¹ `1988` `1989` `1990` `1991` `1992` `1993` `1994` `1995`
  <chr>    <chr> <chr>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
1 Africa   <NA>  <NA>    <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
2 North A… <NA>  <NA>    <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
3 Sub-Sah… <NA>  <NA>    <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
# … with 24 more variables: `1996` <chr>, `1997` <chr>, `1998` <chr>,
#   `1999` <chr>, `2000` <chr>, `2001` <chr>, `2002` <chr>, `2003` <chr>,
#   `2004` <chr>, `2005` <chr>, `2006` <chr>, `2007` <chr>, `2008` <chr>,
#   `2009` <chr>, `2010` <chr>, `2011` <chr>, `2012` <chr>, `2013` <chr>,
#   `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>, `2018` <chr>,
#   `2019` <chr>, and abbreviated variable name ¹​`Reporting year`
military_clean |>
  filter(if_all(Notes:`2019`, ~ is.na(.x))) |> 
  slice_head(n = 3)
# A tibble: 3 × 35
  Country  Notes Repor…¹ `1988` `1989` `1990` `1991` `1992` `1993` `1994` `1995`
  <chr>    <chr> <chr>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
1 Africa   <NA>  <NA>    <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
2 North A… <NA>  <NA>    <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
3 Sub-Sah… <NA>  <NA>    <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
# … with 24 more variables: `1996` <chr>, `1997` <chr>, `1998` <chr>,
#   `1999` <chr>, `2000` <chr>, `2001` <chr>, `2002` <chr>, `2003` <chr>,
#   `2004` <chr>, `2005` <chr>, `2006` <chr>, `2007` <chr>, `2008` <chr>,
#   `2009` <chr>, `2010` <chr>, `2011` <chr>, `2012` <chr>, `2013` <chr>,
#   `2014` <chr>, `2015` <chr>, `2016` <chr>, `2017` <chr>, `2018` <chr>,
#   `2019` <chr>, and abbreviated variable name ¹​`Reporting year`

dplyr updates

There was an update to the dplyr package to Version 1.1.0 on Sunday 1/29/2023.

See changelog for updates

group_by() vs .by = argument

across()

military |> 
  mutate(across(`1988`:`2019`, na_if, y = ". ."))


military |> 
  mutate(across(`1988`:`2019`, ~ na_if(.x, y = ". .")))

join_by()

Saving & Piping data joins

genres_movies_joined <- inner_join(directors_genres, movies_directors)

genres_movies_joined <- directors_genres |> 
  inner_join(movies_directors)

Extensions to Relational Data

IMDb Movies Data

Multiple tables of data are called relational data because it is the relations, not just the individual data sets, that are important.

Keys:

  • Uniquely identifies an observation in a data set
  • Relate data sets to each other

Joining Multiple Data Sets

movies_directors
# A tibble: 41 × 2
   director_id movie_id
         <dbl>    <dbl>
 1         429   300229
 2        2931   254943
 3        9247   124110
 4       11652    10920
 5       11652   333856
 6       14927   192017
 7       15092   109093
 8       15092   237431
 9       15093   109093
10       15093   237431
# … with 31 more rows
directors
# A tibble: 34 × 3
      id first_name   last_name
   <dbl> <chr>        <chr>    
 1   429 Andrew       Adamson  
 2  2931 Darren       Aronofsky
 3  9247 Zach         Braff    
 4 11652 James (I)    Cameron  
 5 14927 Ron          Clements 
 6 15092 Ethan        Coen     
 7 15093 Joel         Coen     
 8 15901 Francis Ford Coppola  
 9 15906 Sofia        Coppola  
10 16816 Cameron      Crowe    
# … with 24 more rows
movies
# A tibble: 36 × 4
       id name             year  rank
    <dbl> <chr>           <dbl> <dbl>
 1  10920 Aliens           1986  8.20
 2  17173 Animal House     1978  7.5 
 3  18979 Apollo 13        1995  7.5 
 4  30959 Batman Begins    2005 NA   
 5  46169 Braveheart       1995  8.30
 6 109093 Fargo            1996  8.20
 7 111813 Few Good Men, A  1992  7.5 
 8 112290 Fight Club       1999  8.5 
 9 116907 Footloose        1984  5.80
10 124110 Garden State     2004  8.30
# … with 26 more rows

join_one_data <- movies_directors |> 
  inner_join(directors, 
             by = c("director_id" = "id")
             )
join_one_data
# A tibble: 41 × 4
   director_id movie_id first_name last_name
         <dbl>    <dbl> <chr>      <chr>    
 1         429   300229 Andrew     Adamson  
 2        2931   254943 Darren     Aronofsky
 3        9247   124110 Zach       Braff    
 4       11652    10920 James (I)  Cameron  
 5       11652   333856 James (I)  Cameron  
 6       14927   192017 Ron        Clements 
 7       15092   109093 Ethan      Coen     
 8       15092   237431 Ethan      Coen     
 9       15093   109093 Joel       Coen     
10       15093   237431 Joel       Coen     
# … with 31 more rows

If you update dplyr

directors_movie_ids <- movies_directors |> 
  inner_join(directors, 
             by = join_by(id == director_id)
             )
join_two_data <- movies_directors |> 
  inner_join(directors, 
             by = c("director_id" = "id")
             ) |> 
  inner_join(movies,
             by = c("movie_id" = "id")
             ) |> 
  rename(movie_name = name)
join_two_data
# A tibble: 41 × 7
   director_id movie_id first_name last_name movie_name               year  rank
         <dbl>    <dbl> <chr>      <chr>     <chr>                   <dbl> <dbl>
 1         429   300229 Andrew     Adamson   Shrek                    2001  8.10
 2        2931   254943 Darren     Aronofsky Pi                       1998  7.5 
 3        9247   124110 Zach       Braff     Garden State             2004  8.30
 4       11652    10920 James (I)  Cameron   Aliens                   1986  8.20
 5       11652   333856 James (I)  Cameron   Titanic                  1997  6.90
 6       14927   192017 Ron        Clements  Little Mermaid, The      1989  7.30
 7       15092   109093 Ethan      Coen      Fargo                    1996  8.20
 8       15092   237431 Ethan      Coen      O Brother, Where Art T…  2000  7.80
 9       15093   109093 Joel       Coen      Fargo                    1996  8.20
10       15093   237431 Joel       Coen      O Brother, Where Art T…  2000  7.80
# … with 31 more rows

Joining on Multiple Variables

Using the hiphop data from Lab 3…

music
# A tibble: 10,752 × 6
   word          subj   folk  rock country   pop
   <chr>         <chr> <dbl> <dbl>   <dbl> <dbl>
 1 [to be] ghost p15       0     1       3     4
 2 [to be] ghost p53       0     0       1     3
 3 [to be] ghost p136      0     0       0     0
 4 [to be] ghost p36       1     1       2     2
 5 [to be] ghost p118      0     1       0     3
 6 [to be] ghost p106      0     0       1     1
 7 [to be] ghost p70       2     0       0     0
 8 [to be] ghost p107      0     1       3     1
 9 [to be] ghost p102      0     0       0     0
10 [to be] ghost p83       0     4       4     4
# … with 10,742 more rows
familiarity
# A tibble: 10,752 × 3
   word          participant familiarity
   <chr>         <chr>             <dbl>
 1 [to be] ghost p15                   1
 2 [to be] ghost p53                   1
 3 [to be] ghost p136                  1
 4 [to be] ghost p36                   1
 5 [to be] ghost p118                  1
 6 [to be] ghost p106                  1
 7 [to be] ghost p70                   1
 8 [to be] ghost p107                  1
 9 [to be] ghost p102                  5
10 [to be] ghost p83                   1
# … with 10,742 more rows
music_wordfam <- music |> 
  full_join(familiarity,
            by = c("word" = "word", 
                   "subj" = "participant")
            )
music_wordfam 
# A tibble: 10,752 × 7
   word          subj   folk  rock country   pop familiarity
   <chr>         <chr> <dbl> <dbl>   <dbl> <dbl>       <dbl>
 1 [to be] ghost p15       0     1       3     4           1
 2 [to be] ghost p53       0     0       1     3           1
 3 [to be] ghost p136      0     0       0     0           1
 4 [to be] ghost p36       1     1       2     2           1
 5 [to be] ghost p118      0     1       0     3           1
 6 [to be] ghost p106      0     0       1     1           1
 7 [to be] ghost p70       2     0       0     0           1
 8 [to be] ghost p107      0     1       3     1           1
 9 [to be] ghost p102      0     0       0     0           5
10 [to be] ghost p83       0     4       4     4           1
# … with 10,742 more rows

if dplyr updated…

music_wordfam <- music |> 
  full_join(familiarity,
            by = join_by(word == word, 
                         subj == participant)
            )
music_wordfam 

Lab + Challenge

Lab 4: Avocado Prices + Challenge 4: Avocado Toast Ate My Mortgage


Handy Helpers

rename() – Change names of columns

separate() – Separate values of a variable


Filtering Joins

semi_join(): Keeps values found in another data set

anti_join(): Keeps values not found in another data set

Workflow

  1. Load packages + read in original data
library(tidyverse)
data_original <- read_csv(file = "path/to/datal.csv")
  1. Clean data – save your changes! This is now your new “master” data set
data_clean <- data_original |> 
  mutate(across(x1:x5, ~ as.factor(.x))) |> 
  mutate(new_var <- if_else(...))
  1. If you need subsets, create those from your new “master” data for the specific tasks.
demographics_subj <- data_clean |> 
  distinct(subj, keep_all = TRUE)
  1. Output only the the information you want to include in your assignment.

To do…

  • Lab 4: Avocado Prices
    • Due Friday, 2/3 at 11:59pm
  • Challenge 4: Avocado Toast Ate My Mortgage
    • Due Saturday, 2/4 at 11:59pm
  • Read Chapter 5: Special Data Types
    • Concept Check 5.1 + 5.2 + 5.3 due Monday (2/6) at 8:00am
  • Bonus Challenge: Murder Mystery in SQL City
    • Due Sunday 2/12 at 11:59pm